Evaluating High Level Parallel ProgrammingSupport for Irregular Applications in ICC + + ?
نویسندگان
چکیده
Object-oriented techniques have been pro ered as aids for managing complexity, enhancing reuse, and improving readability of irregular parallel applications. However, as performance is the major reason for employing parallelism, programmability and high performance must be delivered together. Using a suite of seven challenging irregular applications and the mature Illinois Concert system (a high-level concurrent object-oriented programming model backed by an aggressive implementation), we evaluate what programming e orts are required to achieve high performance. For all seven applications, we achieve performance comparable to the best reported for low-level programming means on large-scale parallel systems. In general, a high-level concurrent object-oriented programming model supported by aggressive implementation techniques can eliminate programmer management of many concerns { procedure and computation granularity, namespace management, and low-level concurrency management. Our study indicates that these concerns are fully automated for these applications. Decoupling these concerns makes managing the remaining fundamental concerns { data locality and load balance { much easier. In several cases, data locality and load balance for the complex algorithm and pointer data structures is automatically managed by the compiler and runtime, but in general programmer intervention was required. In a few cases, more detailed control is required, speci cally explicit task priority, data consistency, and task placement. Our system integrates the expression of such information cleanly into the programming interface. Finally, only small changes to the sequential code were required to express concurrency and performance optimizations, less than 5% of the source code lines were changed in all cases. This bodes well for supporting both sequential and parallel performance in a single code base. keywords: conncurent object-oriented programming, high-level parallel models 1 Overview Object-oriented techniques have been pro ered as aids for managing complexity, enhancing reuse, and improving readability of parallel applications. Numerous variants have been explored (e.g. [40]) with a great volume of technical papers published. Concurrent object systems appear particularly promising for irregular applications where modularity and encapsulation help manage the complex data and parallelism structures. In this paper, we systematically evaluate a high level concurrent object system's programmability for achieving high performance on irregular applications. While signi cant bodies of research exist for both implementing irregular parallel applications e ciently and concurrent object-oriented programming, we know of no studies that addresses both together. While there is a wealth of other research on programming irregular parallel applications, none of these approaches provide a single programming interface while supporting all three dimensions of irregularity { data, control, and concurrency. Research e orts have largely focused on building special libraries which provide runtime support for one or several dimensions of irregularity. For example, the inspector-executor approach uses a library to support iterated irregular communication (such as in nite element simulations) [19]. The 2 Multipol library provides a set of tools to build irregular applications, but no uniform high level interface [39]. Libraries such as Sparspak, Kelp, and A++/P++ support particular classes of algorithms on sparse matrices and adaptive meshes [11, 27]. Of the many concurrent object-oriented programming systems, nearly all have chosen to ignore support for ne-grained parallelism. This greatly complicates the task of programming irregular applications by forcing programmers to map irregular concurrency into grains large enough to achieve e ciency. For example, systems such as pC++ [24], CC++ [7] (these two languages have been combined into HPC++), and Mentat [16] (now a basis for the Legion project) all use heavyweight threads. Of the few systems which have explored ne-grained concurrency (ABCL/f [43], MPC++, and the Illinois Concert system), this study is the rst systematic evaluation of programming e ort for a signi cant range of irregular applications. Our application study is done in the context of the Illinois Concert system, a high performance compiler and runtime for parallel computers which has been the vehicle for extensive research on compiler optimization and runtime techniques over the past ve years [9, 32, 29, 30, 33, 28, 31, 12, 44, 23, 22]. While no system contains all known optimizations, the Concert system contains a wide range of aggressive optimizations, and has been used to demonstrate high performance in absolute terms on a wide range of applications [23, 45, 10]. In e ect, the Concert system automatically addresses many of the concerns which programmers explicitly manage in lower level programming models (e.g. explicit threads or message passing). Using a rich suite of irregular parallel applications, and a mature Concert system, we evaluate the programming e ort for the following irregular parallel computations: graphics (polygon overlay and hierarchical radiosity), N-body codes (barnes-hut and fast multipole method), adaptive mesh codes (structured adaptive mesh re nement), and symbolic computation (gr obner basis and phylogeny). For each application, delivered performance is close to the best possible. Detailed descriptions of the parallelization and optimization steps provide insight into the concerns the programming system (model and implementation) succeeded in automating, and those that remained. In addition, such descriptions also provide insight into the key aspects of parallelizing each application. Our experience indicates that a high-level concurrent object-oriented programmingmodel and a sophisticated implementation can eliminatemany concerns, easing the programming of irregular parallel applications. 3 First, despite the fact that the underlying machines are distributed memories, programmers are freed from explicit namespace management, computation granularity (procedure and data), and low-level memory race management. Second, while in most cases, programmers still needed to express application speci c data locality and load balance, the orthogonal framework for functionality and data placement eased this e ort by allowing radical changes in data layout with modest code changes. While the decomposition of computation and data layout is similar conceptually to that in HPF [13], the decompositions used for our applications are irregular. Finally, while quantitative metrics are only moderately indicative of programming e ort, they nonetheless provide a tangible point of comparison. Over the seven applications only modest code changes (less than 5% of the lines) were required, mostly to express the careful data locality and load balance needed to achieve top performance. Of course, given the diversity of applications studied, several required more detailed control over data locality and load balance. In applications employing heuristic methods, task ordering control is required [21]. And, in several applications, explicit thread placement (complementary to data placement and owner computes) was a more convenient perspective. Finally, the radiosity, grobner, and phylogeny applications beneted from annotations for relaxed data consistency. The remainder of the paper is organized as follows: Section 2 details the key elements of the programming interface, its implementation, and the irregular application suite. Section 3 presents our methodology used in Section 4 to describe the algorithmic and data structure of each application and summarize the e orts required to achieve high parallel performance. Quantitative performance data and an illustration of the performance impact of each type of program transformation is shown in Section 5. We conclude with a discussion and summary in Sections 6 and 7. 2 Background In evaluating a programming approach, our goal is to identify the concerns which must be addressed by a programmer to achieve good performance and how those concerns are managed. Any such experimental evaluation must be done in the context of an actual programming system with its attendant strengths and 4 shortcomings. Finally, the experiments must involve a community of programmers and particular application programs. We describe in turn, the basic programming interface, interfaces for performance optimization, and the compiler and runtime technology of the Illinois Concert system. We conclude by discussing the application programs used in the study (e.g. the workload). 2.1 Basic Programming Interface Our programming interface is a basic object-oriented programming system augmented with simple extensions for concurrency. Thus, programmers can take sequential programs and move them into a parallel environment with little or no change. Then, concurrency can be added as appropriate. The basic model is based on C++, providing a single namespace, scalars and objects, and single inheritance. For further details on the programming model (Illinois Concert C++ or ICC++) see [41]. Simple extensions for concurrency include annotating standard blocks (i.e. compound statements) and loops with the conc keyword. A conc block de nes a partial order amongst its statements, allowing concurrency while preserving local data dependences. Thus, conc can often be applied to existing blocks. Also, conc indicates non-binding concurrency, allowing programmers to specify concurrency and depend on the underlying system to create threads as required. A simple semantics for object concurrency provides synchronization | method invocations on an object execute as if they had exclusive access. The state of the object is sequentially consistent; however operations across multiple objects may not be. This raises the level of consistency from single memory locations to user-de ned objects. 2.2 Performance Optimization Interfaces Data locality and load balance are managed through data placement and colocation, data consistency, and thread placement interfaces. Data locality is essential for high performance on each processor, and load balance is critical for achieving high overall application performance. Data placement and colocation is managed through collections | distributed arrays of objects for which the user can specify standard layouts (e.g. block, cyclic) or full custom distributions. Between individual 5 objects, colocation control can be expressed at object allocation time. Combining these two mechanisms, the placement of any individual object in the system can be controlled. Three additional interfaces support control over data consistency, thread placement, and task ordering. Whereas the data placement interfaces were found useful in all applications, these additional interfaces are only required by a few of the applications. The data consistency interface allows the Concert system to use relaxed consistency models for higher performance. This interface was necessary for two applications, and is expressed using a with local annotation to blocks. As shown in the following code fragment, the with local annotation speci es a set of objects that need to be localized and provides information about access semantics. In this example, the annotation declares that access to objects obj1, and obj2 within the block is read-only, allowing the system to cache them using e cient protocols. function( obj1, obj2, ... ) f with local( obj1, READ ONLY, obj2, READ ONLY ) f /* use obj1 and obj2 here */ ... = func( obj1, obj2, ... ); g... g Thread placement interfaces provide support for controlling data locality and load balance from the alternative perspective of threads rather than objects. Thread placement is speci ed by annotating the thread creation using the !target keyword which takes an integer argument. In the following code fragment, the thread corresponding to thread function is executed on a random processor. ... ( thread function( arg1, arg2, ... ) !target rand() ); ... Task ordering is useful in applications based on search which prioritize promising tasks for higher e ciency. Task ordering is speci ed by annotating the thread creation site using the !priority keyword which takes the task priority as an argument. Additionally, the interface permits the programmer to specify a custom scheduler for thread execution. The Concert runtime system implements a library of schedulers, 6 such as sender-initiated scheduling, task stealing, priority-based task stealing. The Concert runtime also de nes an interface for the user to extend this library by attaching custom schedulers. In the following code fragment, the thread corresponding to thread function executes with priority arg1 and is managed by a global priority scheduler (a custom scheduler). ... ( thread function( arg1, arg2, ... ) !priority arg1 !scheduler GLOBAL PRIORITY ); ... 2.3 Illinois Concert System Our application study is done in the context of the Illinois Concert system, a high performance compiler and runtime for parallel computers which has been the vehicle for extensive research on compiler optimization and runtime techniques over the past ve years [9, 32, 29, 30, 33, 28, 31, 12, 44, 23, 22]. While no system contains all known optimizations, the Concert system contains a wide range of them, and has been used to demonstrate high performance in absolute terms on a wide range of applications [23, 45, 10]. In e ect, the Concert system automatically addresses many of the concerns which programmers must manage in lower level programming models: Procedure Granularity and Virtual Function Call Overhead Aggressive interprocedural optimization and cloning chooses e cient granularities and essentially eliminates virtual function call overhead [28]. Compound Object Structuring Object data inlining based on interprocedural analysis allows the Concert compiler to reorganize data structures with inline allocation, reducing storage management cost and eliminating associated pointer dereferences [12]. Resulting code e ciency can surpass hand-tuned C or C++ manual inline optimization. Thread Granularity Speculative stack-heap execution reduces cost by allowing threads to execute as procedure calls on the stack, lazily creating heap threads only as needed. 70% or greater processor e ciency at 10 s thread granularity has been demonstrated [23]. 7 Distributed Name Management Concert provides a global namespace on both distributed memory and shared memory systems. Fast name translation and data movement (a few microseconds) enable e cient manipulation of remote names and data [23]. Initial data placement, Data consistency and sharing (sometimes) A fast and exible software object caching system can eliminate the importance of initial data placement for some applications. Compiler analyses can provide coarse sharing information which in some cases allows automatic management of data consistency and prefetching. Naturally, in some cases explicit programmer management is required for both initial data placement and data consistency. 2.4 Applications As concurrent object-oriented programming techniques were proposed to manage complex parallel applications, we have collected a suite of challenging irregular programs (see Table 1), each of which is irregular in data structure, parallel control structure, concurrency, or all three. In short, these are applications which are complex to program even on sequential systems, using sophisticated pointer based data structures and e cient algorithms.Applications Description Polyover computer graphics [41] Computes the overlay of two polygon maps. Barnes-hut computational cosmology [42] Computes the interaction of a system of particles in 3D using the Barnes-Hut hierarchical N-body method. FMM computational cosmology [42] Computes the interaction of a system of particles in 2D using the fast multipole method. SAMR computational uid dynamics [14] Solves hyperbolic partial di erential equations using structured adaptive mesh re nement. Gr obner computational algebra [6] Computes grobner basis. Phylogeny evolutionary history [20] Determines the evolutionary history of a set of species. Radiosity computer graphics [42] Computes graph illumination using the hierarchical radiosity method. Table 1: The Irregular Applications Suite The complexity of programming for our applications implies that many are not in widespread use on 8 scalable parallel machines. For those that are [42, 38], extraordinary programming using low level interfaces have previously been required to achieve good performance. To our knowledge, our study is the rst study which evaluates a high level, general purpose programming interface for such a challenging suite of parallel applications. Further, using Concert, we have achieved performance parity with the best known low-level coded results (where available) for each of these applications. Thus, the application summaries constitute the entirety of programmer optimizations required to obtain high performance. 3 Experimental Methodology To achieve e cient parallel execution of an irregular application, there are several concerns that must be addressed by a programmer. The approach we adopt in this study is to evaluate high-level programming approaches by rst identifying these concerns in the context of individual applications, and then examining the extent to which various facets of the programming system help manage the concerns. To provide a common context for discussion, we describe some general concerns below. Transforming a sequential algorithm into an e cient parallel expression requires the programmer to address two high-level concerns | concurrency and synchronization speci cation, and data locality and load balance. 3.1 Concurrency and Synchronization Speci cation The parallelism structure of the application is expressed using concurrency and synchronization speci cations. Concurrency speci cations relax the total order imposed by the sequential algorithm, producing a decomposition of the program control structures into tasks. Synchronization speci cations ensure correct program execution; they specify task dependencies and consistency conditions on shared objects. The programmer may have to restructure the program to describe these speci cations; the support provided in the underlying programming system determines the amount of restructuring. The high-level features of the Concert system considerably simplify the task of describing concurrency and synchronization speci cations. The non-binding concurrency speci cation (which incurs no loss in e ciency due to aggressive 9 implementation) and object-level concurrency control permit the programmer to describe these speci cations in a natural style, allowing the parallel program to retain the core structure of the sequential program. 3.2 Data Locality and Load Balance The second high-level concern that the programmer must address is ensuring that the parallel expression executes with high performance. Data locality is essential for high performance on each processor, and load balance is critical for achieving high overall application performance. The goal of data locality is to ensure that parallel tasks incur reduced communication to access data objects. Achieving this goal requires speci cations about data distribution (mapping of objects across the processors), and data alignment (colocation of objects with respect to tasks or other objects). Given the dynamic nature of irregular applications, data locality may require the use of dynamic techniques such as caching. Speci cations about precise data consistency semantics can enable the use of more e cient relaxed consistency protocols. The goal of load balance is to ensure that the parallel application makes e ective use of all the processors of the machine. Achieving this goal requires speci cations about mapping tasks across the processors of the machine. Task mapping can be controlled by either by specifying data distribution (and executing tasks local to objects that it accesses), or by specifying a task distribution. In addition, applications that rely on prioritized execution of tasks for e ciency require additional speci cations about thread ordering. Programmer e ort required to ensure data locality and load balance depends both on the application structure and the capabilities of the underlying system. In Concert, the shared global namespace provides e cient uniform naming, which allows locality and load balance to be expressed independent of program functionality and concurrency. In addition, a range of compiler and runtime techniques implicitly address the locality and load-balancing concerns. As described in Section 2, Concert provides interfaces for specifying data alignment, distribution, and relaxed consistency semantics (for ensuring data locality), and data distribution, task placement, and task ordering (for ensuring load balance). 10 4 Application Case Studies We illustrate the bene ts of high-level programming features and support using the seven irregular applications as case studies. For each application, we rst describe the algorithmic structure. We next describe the parallelization of the application, focusing on programmer e ort associated with concurrency and synchronization speci cation, and ensuring data locality and load balance. We summarize the description for each application by discussing the extent to which support in the Concert system helps manage the programmer concerns. 4.1 Barnes-Hut The Barnes-Hut code computes the interactions of a system of particles in 3-D using the Barnes-Hut hierarchical N-body method [1]. 4.1.1 Algorithmic Structure The Barnes-Hut method exploits the physical insight that the interactive force between two particles decreases at a rate proportional to the square of their distance to reduce computational complexity. The primary data structure is an oct-tree data structure that hierarchically divides the simulation space into cubes, called spatial cells. Leaf spatial cells in turn point to their contained particles. The spatial cells represent the aggregate center-of-mass and force information of its contained particles. Interactions between particles su ciently far apart are summarized as particle-cell interactions, reducing the computational complexity fromO(N2) to O(Nlog(N )). However, the complexity reduction comes at a cost of more complex data structures. Particle spatial distributions can be highly non-uniform, producing highly unbalanced oct-trees. In addition to oating point computations corresponding to interactions, traversing the oct-tree e ciently is also crucial for the performance of Barnes-Hut. The computation proceeds in iterations. During each iteration, an oct-tree is rst built given the updated position of particles. The tree is then used to compute the force exerted on each particle, and subsequently the velocities and positions of the particles are updated. The force computation phase dominates the computation, representing more than 95% of the sequential computation time. During the force computation 11 phase, each particle traverses the global oct-tree recursively. However, each traversal is a partial traversal. At each cell, the distance between its center-of-mass and the current position of the particle is calculated. If the distance is larger than a cuto , the traversal of the subtree terminates, and the interactions of the particle with all particles contained in the cell are computed as the particle-cell interaction. If the distance is within the cuto , the traversal continues with the children cells of the current cell. Therefore, the data access pattern of each tree traversal is highly data-dependent and irregular. 4.1.2 Program Parallelization Our parallel algorithm is derived from the Barnes-Hut application in the SPLASH-2 benchmark suite [42]. Concurrency and Synchronization Speci cation There are two levels of parallelism in the dominant force computation phase. Oct-tree traversals for all particles are concurrent, and within a traversal, the traversal of subtrees are also concurrent. Both levels of parallelism are naturally expressed in Concert using the conc annotations with the corresponding code in the sequential version. The global namespace and default object sequential consistency provided by ICC++ ensures that no additional code is required for parallel tasks to access potentially remote objects, such as during traversals of the global oct-tree. The following code fragment illustrates the Barnes-hut force computation phase. // conc keyword annotates concurrency conc for (i = 0; i < Num_particles; i++) particles[i].compute(ROOT, cutoff); ... compute(cell, cutoff) f if (distance(cell) < cutoff) conc for (i = 0; i < NDIM; i++) if (cell->children[i] != NULL) compute(cell->children[i], cutoff/4.0); else f ... = cell->pos; ... = cell->mass; // cost for history-based load balancing cost++; g g Two conc annotations are needed for the two levels of parallelism: across separate tree traversals and across traversals of subtrees at each level. Besides the conc annotations, no additional code is required 12 to manage name translation and ensure consistency when accessing and updating the global oct-tree. In addition, because the conc annotations are non-binding, they can be liberally used to expose concurrency while still allowing the compiler to either exploit concurrency for parallelism or ignore the concurrency for sequential e ciency. In all, there are 64 lines (out of 2,631 lines) in the code with conc annotations. Data Locality and Load Balance Barnes-hut exhibits both data-dependent irregular data access patterns and computation granularity. The dominant computation phase consists of concurrent partial traversals of the oct-tree data structures, where both the portion of the tree traversed and the depth of traversal are dependent on a particle's spatial location. Because the computation is ne-grained { a single particle-particle or particle-cell interaction incurs about ten oating point operations while accessing two tree elements { good performance requires good data locality to optimize remote accesses and balancing the amount of work across processors. We ensure data locality and load balance using a spatial partitioning of the tree data structure and a history-based load balancing algorithm, adopted from [35]. The spatial partitioning constructs a PeanoHilbert ordering of all the particles according their spatial coordinates. This linear ordering is then divided into blocks and assigned to processors. Because the Peano-Hilbert ordering inherently groups particles that are close together spatially, this assignment maximizes access locality and data reuse for accessing remote tree nodes. The history-based load-balancing counts the number of interactions for each particle (incrementing the cost variable in the above code example) and use these counts as weights to assign the number of particles to processors for the next iterative step. The number of interactions in the previous iteration is a good estimate of the amount of future work because particles move slowly during simulations. We use the data placement interface in the Concert system to achieve the assignment of particles to processors. Given the default owner-computes policy of Concert, separate code to distribute the tree handles both loadbalancing and initial placement. The computation per se is decoupled from data placement, enabling easy experimentation with data placement and load balancing policies. In all, data placement and load balancing code consists of 33 line changes (out of 2,631 lines). 13 4.1.3 Summary of Changes Concurrency and synchronization speci cation in the Barnes-hut code is straightforward, corresponding naturally to the program structure. The non-binding concurrency, global namespace and object consistency support in ICC++ allow the expression of these speci cations with minimal changes to the sequential version. Ensuring data locality and load balancing requires more understanding of the application. The programmer needs to provide an initial data placement of particles to processors to achieve both good data locality and load balance. However, the data placement is decoupled from the algorithmic speci cation. In all, less than 5% of the lines require changes for the parallel versions: 107 out of 2,631 lines. 4.2 FMM The FMM code computes the interactions of a system of particles in 2-D using the fast multipole (FMM) hierarchical N-body method [15]. 4.2.1 Algorithmic Structure The FMM method also exploits the physical insight that the interactions between two particles decreases at a rate proportional to the square of their distance to reduce the computational complexity. The primary data structure is an quad-tree data structure that hierarchically divides the simulation space into squares, called boxes. Leaf boxes in turn point to their contained particles. Particle-particle interactions within each box are computed directly. Each spatial boxes hierarchically summarize its particle information in a multipole expansion. This expansion is then used to compute interactions between particles that do not lie within the same box through either particle-box or box-box interactions. The FMM method has excellent error control and can reduce the computational complexity from O(N2) to O(N ). However, like the Barnes-hut code, the complexity reduction comes at a cost of more complex data structures. Again, particle spatial distributions can be highly non-uniform, producing highly imbalanced quad-trees, non-uniform compute granularity and irregular data access patterns. The computation proceeds in iterations. During each iteration, an quad-tree is rst built given the updated position of particles. The quad-tree is then traversed once to compute, for each box, four interaction 14 lists of boxes. During the force computation phase, each box traverses its four interaction list and computes the interactions with each box in the lists in terms of multipole expansions. The expansions are then hierarchically distributed to particles and used to update the velocities and positions of particles. The force computation phase dominates the computation, representing more than 90% of the sequential computation time. 4.2.2 Program Parallelization Our parallel algorithm is derived from the FMM application in the SPLASH-2 benchmark suite [42]. Concurrency and Synchronization Speci cation There are three levels of parallelism in the dominant force computation phase: across interactions corresponding to separate boxes, across computations with di erent interaction lists for a single box, and across interactions with boxes within the same interaction list. These levels of parallelism are naturally expressed in Concert using the conc annotations with the corresponding code in the sequential version, as illustrated by the code below. The global namespace and default object sequential consistency provided by ICC++ ensures that no additional code is required for parallel tasks to access potentially remote boxes in the interaction lists. In all, there are 126 lines (out of 3951 lines) in the code corresponding to concurrency speci cations. conc for (i = 0; i < NumBoxes; i++) boxes[i].computeInteractions(); ... computeInteractions() f if (type == CHILDLESS) conc f computeSelfInteraction(); conc for (i = 0; i < num_u_list; i++) UListInteraction(u_list[i]); conc for (i = 0; i < num_w_list; i++) WListInteraction(w_list[i]); gconc for (i = 0; i < num_x_list; i++) XListInteraction(x_list[i]); conc for (i = 0; i < num_v_list; i++) VListInteraction(v_list[i]); g 15 Data Locality and Load Balance FMM exhibits both data-dependent irregular data access patterns and computation granularity. The interaction lists of a box is dependent of its spatial coordinates, producing irregular data accesses. In addition, the lengths of interaction lists are highly irregular across boxes, making both data locality and load balancing essential for good performance. Similar to Barnes-Hut, we ensure data locality and load balance using a spatial partitioning of particles and boxes and a history-based load balancing algorithm, adopted from [35]. The spatial partitioning constructs a Peano-Hilbert ordering of all the particles according their spatial coordinates. This linear ordering is then divided into blocks and assigned to processors. Because the FMM method operates on boxes, particles within the same box are assigned to the same node, and boxes are assigned to the processor that owns a majority of its sub-boxes with ties broken arbitrarily. The history-based load-balancing counts the number of interactions for each particle and use these counts as weights to assign the particles and boxes to processors for the next iterative step. Again, the number of interactions in the previous iteration is a good estimate of the amount of future work because particles move slowly during simulations. We use the data placement interface in the Concert system to achieve the assignment of particles to processors. Given the default owner-computes policy of Concert, separate code to distribute the tree handles both load-balancing and initial placement. The computation is decoupled from data placement, enabling easy experimentation with di erent policies. The data placement and load balancing code consists of 37 line changes (out of 3951 lines). 4.2.3 Summary of Changes Concurrency and synchronization speci cation in the FMM code is straightforward and correspond naturally to the program structure. The non-binding concurrency allows concurrency to be exposed to the compiler without performance penalties, and the global namespace and object consistency support in ICC++ allow the expression of these speci cations with minimal changes to the sequential version. Ensuring data locality and load balancing requires understanding of the application. The programmer needs to provide an initial data placement of particles to processors to achieve both good data locality and load balance. However, the data placement is decoupled from the algorithmic speci cation. In all, less than 5% of the lines require 16 changes for the parallel versions: 163 out of 3951 lines. 4.3 SAMR In many elds, simulations involve solving hyperbolic partial di erential equations, such as the equations of uid dynamics. The practical task of numerically solving these equations involves discretizing them in both space and time. Structured Adaptive Mesh Re nement (SAMR) methods use dynamically created meshes to discretize such simulations, conserving computational resources. For our investigation, we have based our parallel method on a sequential C++ code that is in use by the Computational Cosmology Group at the National Center for Supercomputing Applications[4]. 4.3.1 Algorithmic Structure The basic idea in SAMR is to adapt a hierarchy of grids (eq. meshes) over time, in order to model a physical space e ciently. In this context, hierarchy means a logical arrangement of the grids into a series of levels. Grids at a higher level model with less precision than grids at lower levels. Grids are created and deleted from the hierarchy based on how the simulation is evolving. Lower level (i.e. higher precision) grids are only created when necessary to model rapidly changing areas of the physical space. Thus, the less dynamic regions of the physical space can be modeled coarsely, saving unnecessary computation. Our implementation uses two primary data structures: a grid-hierarchy data structure and grid data structures. The grid-hierarchy data structure represents the state of the current simulation. It is comprised a series of levels which are lists of grids. The grid structures are responsible for the actual modeling of the physical space, and as such contain arrays of oating point values that represent state information about the discretized area. Each grid also contains a number of boundary zones, which are used in grid-to-grid communication. The simulation proceeds by advancing a virtual clock through a number of timesteps, each comprised of a sequence of phases. Phases are either compute phases or grid-to-grid communication. The phases operate over grids or sets of grids at a particular level in the hierarchy. Examples include the compute-intensive numerical solve phases over grids at a level, and the communication-heavy grid-to-grid copy phase over 17 pairs of grids at a level. Global dependences between levels prevent the phases from executing across level boundaries. The algorithm terminates when an appropriate number of iterations have been completed, or when the virtual clock exceeds a predetermined limit. 4.3.2 Program Parallelization Parallelism our SAMR implementation resides within each phase of execution, across grids at a particular level in the hierarchy. The numerical solve phase, for example is independent over grids at a level. The full simulation contains 7 independent phases in total. Thus, the simulation can be executed concurrently by parallelizing each of these phases. SAMR poses a number of signi cant challenges for parallelization. Since the method adapts the hierarchy over the course of the simulations execution, both data structures and parallel work are created dynamically. This means that we have no way of knowing, at compile or initialization time, the extent of the parallel phases. Therefore, we need language support for expression of dynamic parallelism, and dynamic distribution of data structures at runtime for load balance. Granularity control is another signi cant challenge. A simple parallelization would exploit parallelism across grids in a level, distributing the grids across the machine. This produces insu cient parallelism and in general poor load balance. Our approach overcomes these challenges, through system support and careful implementation choices. Concurrency and Synchronization Speci cation The parallelism in the simulation is expressed as a set of parallel method calls over grids, or sets of grids at a level. Again, levels are adapted at each iteration, so that this type of parallelism is dynamic. Here, we assume that grids at a level have been distributed across processing elements. Using the conc keyword, we can easily express such parallelism as follows. conc for( grids G at this level ) f G.DoWork(); g... Grid::DoWork()f 18 // Independent work on Grids g This use of conc allows each of the DoWork calls to execute independently, synchronizing at the end of the loop. Concert's runtime support automatically distributes the DoWork calls, and the concurrent execution of the loop happens transparently. Granularity Control and Load Balance Issues of load balance and grain size control are addressed by tiling. We divide each grid into a set of independent uniform-sized tiles. From the simulation's point of view, nothing has changed; it is still manipulating a hierarchy of grids. However, logical transformation of grids into an array of independent tile objects gives both load balancing and granularity control. A grid now becomes a container of tiles, which are distributed via the ICC++ distributed array data type, collections. We use conc as simply as before the transformation, but the concurrent loop is now doubly nested. conc for( grids G at this level ) f conc for( tiles T in grid G ) f T.DoWork(); g g... Tile::DoWork() f // Independent work on Tiles g This construction still allows the DoWork calls to be independent method calls over distributed objects. However, by controlling tile size, we can control the granularity of the independent pieces of work by adjusting the size of the individual tiles. Moreover, the adjustments to grain size control do not a ect the size of grids from the simulations point of view. This transformation also promotes load balance, as the size of units of work are generally proportional to tile size, which is made uniform. 19 4.3.3 Summary of Changes The complexity involved in parallelizing our SAMR simulation method was modest in terms of programmer e ort. The rst task was to code the tiling data structure transformation. While the modi cations required were far-reaching throughout the code, they were not overly complex, because the tile structures closely resembled the grid data structures. Thus, mainly interfaces of former grid functions had to be modi ed and not entire bodies of functions. The next step was instrumenting the code to run concurrently. This largely consisted of adding the conc keyword to a number of loops, and implementing grids as collections of tiles. Due to ICC++'s similarity to C++, this took very little programmer e ort once the parallel loops were identi ed. To annotate for concurrent execution, about 163 out of 6303 lines (3%) were modi ed. Note that in the case of SAMR, no additional load balance and distribution e ort was needed on the part of the programmer. The collection data type ensures cyclic distribution, and concurrent method calls with synchronization happen transparently in concurrent loops. 4.4 Polyover Polyover computes the overlay of two polygon maps. As our starting point we use a fast sequential algorithm described in [41]. 4.4.1 Algorithmic Structure The two polygon maps, referred to as the left and right map, are represented as a vector and a link list respectively. Parallelism is across all polygons in the left map, with each polygon in the left map being intersected in turn with each polygon in the right map, The intersection procedure is made e cient by keeping track of polygon overlap areas. The procedure terminates for each polygon when it is fully covered. In addition, a polygon is pruned from the right map list when it has been entirely covered by one or more left map polygons. These optimizations produce irregular concurrency and tricky sharing of dynamic data structures. 20 Polyover poses two main implementation challenges. First, the primary data structure is a globallyshared linked list implementation of the right map that is concurrently traversed and modi ed. Second, the order-sensitivity of the polygon sizes and comparisons produces challenging load-balance problems. 4.4.2 Program Parallelization Our parallel algorithm exploits data parallelism across the polygons in the left map, and pipeline parallelism amongst those in the right map. Concurrency and Synchronization Speci cation To parallelize Polyover, a programmer must provide concurrency annotations for the key inner loops { the polygon comparisons and area information. Such annotation is shown in the code example below; only two conc annotations are required. The global namespace and the object-level sequential consistency in ICC++ permits a natural expression of the code and implicitly maintains consistency of the list. Overall, the code required changes in 12 lines (out of 1695). rightList = polyVec2AreaLn(right, TRUE); conc for (i=0; i 0) && ((rightCurr = rightPre->link) != NULL)) f if ((newLn = polyOverlayLn(pl, rightCurr->poly)) != NULL) f // add new polygon gif (rightCurr->area == 0) rightPre->link = rightCurr->link; // delete from list else rightPre = rightCurr; // traverse next g gData Locality and Load Balance To implement Polyover e ciently in ICC++, the programmer must also control the data layout. The challenge is to maintain reasonable load balance and e ciency (because the computation is extremely negrained). We solve this problem by distributing the list data structure according to a block distribution. 21 The size of the block ensures high e ciency at each node. Reasonable load balance was achieved by relying on a degree of randomness in the inputs, so despite the irregularity of the computation good load balance will occur. An alternative approach would be to use a block-cyclic distribution, as the number of polygons is often quite large. Expressing the block distribution was straightforward, as it is completely orthogonal to the expression of the program. The data distribution for the right map is a simple annotation. Overall, the program required changes to 69 lines for data locality and load balance. 4.4.3 Summary of Changes The overall changes required for parallelization are quite small. Only a few conc annotations provided the requisite concurrency. The computation speci cation is completely decoupled from the data placement, permitting easy experimentation with placement policies. Overall, less than 5% of lines (81 out of 1695) require changes from the sequential version. The global address space is a clear improvement over a distributed namespace, as distributed model would require explicit name management of list elements. The programmingwas easier than on a shared-memory system, as the programmer did not need to do any explicit locking. 4.5 Gr obner The Gr obner application is from the symbolic algebra domain and computes the Gr obner basis of a set of polynomials. We use a sequential algorithm due to Buchberger [5] as our starting point. 4.5.1 Algorithmic Structure The algorithm starts o with an initial basis set of polynomials equal to the input set. Then, each possible pair of polynomials is evaluated, ranked according to a heuristic metric. Evaluation of a polynomial pair (which consists of arbitrary-precision arithmetic operations to isolate irreducible polynomial terms) can potentially lead to the addition of a new polynomial into the basis. This, in turn, generates additional polynomial pairs for evaluation which correspond to the new polynomial and existing basis polynomials. The algorithm completes when there are no polynomial pairs left for evaluation. The algorithm requires data structures 22 that can represent arbitrary precision arithmetic on polynomials, as well as structures that can dynamically grow to store the basis. In addition, it requires iterative control structures with data dependent termination criteria. Thus, the algorithm is challenging to express even on sequential platforms. The algorithm implementation uses two classes: polynomials and the basis. Polynomial objects store coe cients, exponents, and terms of polynomials being manipulated. The basis object implements a multiset abstraction and stores pointers to the di erent polynomials (using an array which expands as required). Polynomial pairs are stored in a priority queue (required due to the heuristic evaluation metric): computation consists of dequeuing the next pair of the priority queue and evaluating it until the queue is empty. 4.5.2 Program Parallelization Our parallel algorithm is derived from a previous parallelization of the Grobner basis problem on distributed memory machines by Chakrabarti [6]. Concurrency and Synchronization Speci cation The parallelism in the application arises from parallel evaluation of polynomial pairs. This is naturally expressed in Concert using a conc loop around the pair generation code in the sequential program. Each pair creates a dynamic task which accesses both the basis and polynomial objects. A task that produces a polynomial with irreducible terms invokes a method on the basis object to augment it. Support of a global namespace in Concert ensures that no additional code is required for these tasks to access potentially remote objects (either the polynomial or the basis); thus, tasks can use code identical to the sequential methods. The only synchronization speci cation required by the application is that the basis object be kept consistent against concurrent access, i.e., the basis must be accessed and augmented within a critical section. This speci cation is trivially ensured in Concert due to the object-level concurrency control semantics. The natural expression of concurrency and synchronization speci cations only require changes at 5 lines in the program (out of 1919 ICC++ lines). Data Locality and Load Balance This application exhibits data-dependent irregularity in how long each pair evaluation takes. Additionally, the work varies with the order of pair evaluations (which a ects how the basis grows). Consequently, careful management of data locality and load balance is required for 23 e ciency. Data locality ensures that the pair tasks have e cient access to the basis and its polynomials. For load balancing reasons, pairs need to be evaluated on arbitrary processors requiring remote access to polynomial and basis objects. We use demand-driven caching of both the polynomial and basis objects. The Concert implementation optimizes the caching performance by utilizing additional information about relaxed data consistency semantics. The additional information conveys the fact that polynomials are accessed in a readonly fashion, and that the computation can proceed with an inconsistent (out-of-date) view of the basis. The latter permits the overlap of pair evaluations with basis augmentations, and is expressed as shown in the following code segment. Over the entire program, the programmer needs to specify access information about 2 kinds of views which requires changes to 4 lines in the program. processPair( pOne, pTwo ) f spol = spol( pOne, pTwo ); /* can use an inconsistent copy of global Basis */ basisp = &(Basis); with local( basisp, READ ASYNC ) f ... = basisp->setReduce( spol ); g g Load balance is required because of the irregular computation granularity. This must be balanced against the demand of ordering pair evaluations according to a heuristic metric, crucial to avoid redundant work. Our solution accomplishes these goals by attaching the pair evaluation task to a priority-based work-stealing scheduler (provided as part of the Concert system) as shown in the code fragment below. The code illustrates the three aspects of programmer input required for load balancing this application: the programmer needs to identify the task granularity for load balancing (in this case, the processPair function), supply an integer priority for the task (using the !priority keyword), and select a scheduler for enqueuing the task (in this case, the GLOBAL PRIORITY scheduler). Similar custom load-balancing annotations were required to 2 lines in the program. ... /* pNew is being added to the Basis 24 generate work by iterating over existing polynomials */ conc for ( ... ) f pOld = ... if ( (pOld != pNew) && test(pOld, pNew) ) f priority = calculate priority( pOld, pNew ); ( processPair( pOld, pNew ) !priority priority !scheduler GLOBAL PRIORITY ); g g... 4.5.3 Summary of Changes Concurrency and synchronization speci cation in the Gr obner application require very little programmer e ort, corresponding naturally to the structure of the sequential algorithm. Furthermore, the high-level features of the Concert system (non-binding concurrency, global namespace, and object-level concurrency) allow these speci cations to be expressed as minimal changes to the original sequential source. On the other hand, ensuring locality and load balance requires more e ort, primarily due to the dynamic, irregular nature of the application. Programmer e ort was required for two things: providing relaxed data consistency speci cations for ensuring e cient data locality of polynomials and the basis, and specifying thread ordering and scheduling for load balance by attaching the processPair task to a priority-based scheduler. In each case, the Concert system provides support which requires minimal changes to source code, allowing the programmer to focus on high-level structural issues. In all, less than 1% of the code required modi cation (11 out of 1919 lines) for describing the concurrency and synchronization speci cations (5 lines), and for ensuring data locality (4 lines) and load balance (2 lines). 4.6 Phylogeny The Phylogeny application computes the evolutionary history of a set of species, representing the history as a tree with the path from the root to a species showing the evolutionary path of that species. We use a sequential algorithm described by Jones [20] as our starting point. 25 4.6.1 Algorithmic Structure Phylogenetic trees are constructed by considering the characters exhibited by the species. Given a set of species, with each species u represented as a vector of character values, u[1]; : : :; u[cmax] (cmax is the maximum number of characters to be considered), a character is compatible with a phylogenetic tree if no value for that character arises more than once in any path in the tree. A perfect phylogenetic tree for a set of species and a given set of characters is a phylogenetic tree compatible with all the characters. Our algorithm for the general phylogeny problem is based on a method known as character compatibility and uses a solution to the perfect phylogeny problem (determining whether or not a perfect phylogenetic tree exists) (see [20] for details) as a subroutine. The essence of the character compatibility method is a search for the largest compatible subset of characters, the rationale being that if the subset is large enough, the corresponding perfect phylogeny will be a good estimate of the evolutionary history of the species. Thus, the algorithm searches over the power set of characters (each possible subset) asking whether or not each subset is compatible. The algorithm relies on extensive pruning based upon the property that the subsets of a compatible set are also compatible. The algorithm maintains a failure set of subsets. Computation corresponds to a bottom-up search of the character lattice (depthrst and right-to-left beginning with small subsets and progressing to larger subsets). Each subset is rst checked against the failure set; no subset that has a subset that failed requires the perfect phylogeny procedure, otherwise the perfect phylogeny procedure is called. Based on the result of the procedure, either the subset is added to the failure set, or additional subsets are generated for exploration. The implementation of the algorithm uses one kind of object, a solution node, which is organized as a trie to implement the failure set (to facilitate e cient subset checking). Each solution node maintains a character subset, and child and parent pointers of the trie. Given the sophisticated nature of the data structures, and the pruning-based search procedure, the algorithm is complicated to express even on sequential platforms. 4.6.2 Program Parallelization Our parallel algorithm follows the strategy of [20] and evaluates multiple character subsets in parallel. Given that adequate parallelism exists at this level, we have chosen to utilize a bundled solution to the parallel 26 phylogeny problem as a (sequential) leaf subroutine. Concurrency and Synchronization Speci cation The parallelism in the application is naturally expressed in Concert by annotating the work generation loop with the conc keyword. The non-binding concurrency semantics of Concert permit all the parallelism to be expressed essentially without concern for the computation granularity of each task. Each task traverses the global failure set. Support of a global namespace in Concert ensures that no additional code is required for tasks to access potentially remote solution nodes; thus, tasks can use code identical to sequential methods. The only synchronization speci cation required by the application is that the failure set be kept consistent against concurrent access, i.e., the failure set must be accessed and augmented within a critical section. This speci cation is trivially ensured in Concert (in fact a weaker form of the speci cation is implemented) due to the object-level concurrency control semantics associated with the solution node objects. The natural expression of concurrency and synchronization only require changes to 4 lines in the program (out of 2305 lines). Data Locality and Load Balance This application exhibits data-dependent irregularity in how long each subset evaluation takes. Additionally, the work varies with the order of subset evaluations (which a ects how the failure set grows). Consequently, careful management of data locality and load balance is required for e ciency. Data locality ensures that subset tasks have e cient access to the solution nodes comprising the failure set. For load balancing reasons, subsets need to be evaluated on arbitrary processors and may require remote access to solution node objects. We use demand-driven caching of solution node objects. The Concert implementation optimizes the caching performance by utilizing additional information about relaxed data consistency semantics. The additional information conveys the fact that subset tasks can proceed with an inconsistent (out-of-date) view of the failure set. This permits overlap of subset evaluations with failure set augmentation, and is illustrated in the following code segment. Over the entire program, the programmer needs to specify access information about 1 kind of view which requires changes to 10 lines in the program. SolnNode::SupersetOfSoln( elt, cset ) f 27 if ( cset.member(elt) ) f ... /* recursively traverse children */ with local( left, READ ASYNC, right, READ ASYNC ) f ... = left->SupersetOfSoln( elt+1, cset ); ... = right->SupersetOfSoln( elt+1, cset ); g g g Load balance is required because of the irregular computation granularity. This must be balanced against the demand of ordering subset evaluations to correspond to a bottom-up traversal of the character lattice, crucial to avoiding redundant calls to the perfect phylogeny routine. Our solution accomplishes these goals by attaching the subset task to a priority-based work-stealing scheduler. As with the Gr obner application, programmer e ort is required to identify the task granularity for load balancing (the subset task), supply an integer priority (based on a bit-encoding of the subset), and select a scheduler (in this case, the LOCAL PRIORITY scheduler). Such custom load-balancing annotations were required at 2 places in the program. 4.6.3 Summary of Changes Concurrency and synchronization speci cation in the Phylogeny application require very little programmer e ort, corresponding naturally to the structure of the sequential algorithm. Furthermore, the high-level features of the Concert system (non-binding concurrency, global namespace, and object-level concurrency) allow these speci cations to be expressed as minimal changes to the original sequential source. On the other hand, ensuring locality and load balance requires more e ort, primarily due to the dynamic, irregular nature of the application. Programmer e ort was required for two things: providing relaxed data consistency speci cations for ensuring e cient data locality of solution nodes, and specifying thread ordering and scheduling for load balance by attaching the subset task to a priority-based scheduler. In each case, the Concert system provides support which requires minimal changes to source code, allowing the programmer to focus on high-level structural issues. In all, less than 1% of the code required modi cation (16 out of 2305 lines) for describing the concurrency and synchronization speci cations (4 lines), and for ensuring data locality (10 lines) and load balance (2 lines). 28 4.7 Radiosity The radiosity method computes the global illumination in a scene containing di usely re ecting surfaces by expressing the radiosity of an elemental surface patch as a linear function of the radiosities of all other patches, weighted by the distance and occlusion between the patches. As our starting point, we use a sequential algorithm due to Hanrahan [17], modeled after hierarchical N-body methods. 4.7.1 Algorithmic Structure Traditional radiosity approaches rst subdivide the polygonal surfaces that describe a scene into small elemental patches with roughly uniform radiosity and then compute the radiosity of a patch as the aggregation of pair-wise interactions between the patch and every other patch. In contrast, the hierarchical algorithm starts o with the initial patches comprising the scene and computes light transport between pairs of patches, hierarchically subdividing each patch as needed to ensure accuracy. This on-demand division of patches is responsible for reducing the complexity of the sequential algorithm from O(N2) to O(N ). However, the complexity reduction comes at a cost: the new algorithm involves more sophisticated data structures (e.g., a quad-tree to maintain the subdivision structure), and more complex control structures (e.g., recursion over the child patches and data-dependent iteration). Thus, the algorithm is challenging to express even on sequential platforms. The algorithm implementation uses two objects: patches and interactions. Patches represent discrete elements of the di usely re ecting surface, while interactions store information about the visibility and \formfactor"1 of patch pairs. These objects are organized into three principal data structures: a quad-tree of patches, a list of interactions, and a binary space partitioning (BSP) tree of the input patches. Each patch maintains interaction lists of potentially visible neighbors (initially every other patch). Computation proceeds in iterations, where for each patch, we compute its radiosity as a linear combination of the radiosities of all patches on its interaction list, subdividing it or the other patches hierarchically as required to ensure accuracy (based upon an error computation). Radiosity computation requires that the visibility 1The \formfactor" determines the contribution of a patch's radiosity to other patches and depends upon the shape, distance, and relative orientation of the two patches. 29 and formfactor of each interaction be previously calculated. Formfactor calculation depends entirely on the patch geometries, while the visibility calculation requires ray tracing. The visibility calculation dominates the computation time, and is made e cient using the BSP tree. Subdivided patches are organized as a quad-tree, acquire their own interaction lists, and contribute a weighted term to their parent's radiosity. The iterations terminate when the change in the total radiosity (weighted sum of all input patches) falls below a threshold. 4.7.2 Program Parallelization Our parallel algorithm is derived from the radiosity application in the SPLASH-2 benchmark suite [42]. Concurrency and Synchronization Speci cation There are three levels of parallelism in each iteration: across all input patches (radiosity calculation), across child patches of a subdivided patch (radiosity calculation), and across neighbor patches stored in the interaction list (visibility, error, and radiosity calculation). These three levels are naturally expressed in Concert using conc annotations with the corresponding functions in the sequential program. Support of a global namespace in ICC++ ensures that no additional code is required for parallel tasks to access potentially remote objects (e.g., in tree traversal); tasks can use code identical to the sequential methods. Two synchronization speci cations are required to ensure correct execution. The rst speci es a dependence between tasks: the radiosity calculation of a child patch must be nested within the radiosity calculation of the parent patch. The second speci es that each task accessing a patch object has exclusive access. Both speci cations are trivially ensured in Concert: the former by using the same code body between the sequential and parallel versions, and the latter due to the object-level concurrency control semantics. In fact these speci cations are only required in 7 places in the program (out of 3144 lines). Data Locality and Load Balance This application exhibits data-dependent irregularity at each of the levels of parallelism: in how deep a patch can be re ned, in how many interactions need to be computed for each patch, and even how long a particular interaction calculation takes. Additionally, work across the iterations is not uniform with > 90% of the computation occuring in the rst iteration. Consequently, careful 30 management of data locality and load balance is required for e ciency. Data locality is an issue primarily for the radiosity and visibility tasks. The radiosity computation on each patch accesses its own state as well as, for each neighboring patch, the neighbor's state and the state stored in the interaction object. The visibility task not only accesses the state of the pair of patches, it also traverses the BSP tree in a data-dependent fashion. We ensure data locality using a combination of data alignment and demand-driven object caching. Locality of interaction objects and the BSP tree is ensured by data alignment. The interaction objects are aligned with the source patch ensuring locality for computations executing local to the source patch. We use the fact that the BSP tree is built once at the beginning of the computation to explicitly replicate it on each processor. However, patch locality requires demand-driven caching. The Concert implementation optimizes the caching performance by utilizing additional information about relaxed data consistency semantics. As the simpli ed visibility calculation code below shows, the programmer uses the with local annotation to specify an application-speci c access semantics { a view [22] { to the compiler and runtime system. In this case, both input polygonal patches are speci ed to be read-only, allowing the Concert system to e ciently cache the data to exploit data reuse. Over the entire program, the programmer needs to specify access information about 3 kinds of views. Over the entire program, ensuring data locality increases the source code length by 33 lines (as compared to the original 3144 lines): of these the bulk (25) ensures the replication of the BSP structure, the rest describes the alignment and data consistency speci cations. computeVisibility(spatch, dpatch, cur) f with local( spatch, READ ONLY, dpatch, READ ONLY ) f /* compute visibility here */ vis = func( spatch->pos, dpatch->pos, ... ); gcur->vis = vis; g Load balance is primarily required to balance the radiosity and visibility tasks across the machine, and is achieved using a combination of data distribution and dynamic load-balancing. Allocating dynamically created patches on random nodes at time of creation (the default policy in Concert), and executing radiosity tasks local to them ensures that the radiosity task calculation is load balanced. However, a similar scheme 31 introduces signi cant imbalance for the visibility tasks. To load balance the latter, we utilize a senderinitiated dynamic load-balancing algorithm which requires the programmer to identify the task which must be balanced, and specify its target destination. In the code fragment below, the programmer chooses the computeVisibility function as the granularity for load-balancing, and uses the !target keyword to place the task on a random destination node upon creation. These target annotations were required at 3 places in the program. conc for (cur=start_interaction_list; cur; cur = cur->next ) f if ( cur->vis == VISIBILITY_UNDEF ) ( computeVisibility( this, cur->dest, cur ) !target rand() ); g4.7.3 Summary of Changes Concurrency and synchronization speci cation in the Radiosity application require low programmer e ort, corresponding naturally to the structure of the sequential algorithm. Furthermore, the high-level features of the Concert system (non-binding concurrency, global namespace, and object-level concurrency) allow the expression of these speci cations as minimal changes to the original source of the sequential program. On the other hand, ensuring locality and load balance requires more e ort on part of the programmer, primarily due to the very irregular nature of the application. Programmer input was required for three things: replication of the BSP tree (a structural change in the program), data consistency semantics for e cient patch caching, and load-balancing of visibility tasks. In each case, the Concert system provides support which requires only minimal changes to source code, allowing the programmer to focus on high-level structural issues. In all, around 2.5% of the code required modi cation (43 out of 3144 lines) for describing the concurrency and synchronization speci cations (7 lines), and for ensuring data locality (33 lines) and load balance (3 lines). 4.8 Summary All of the above applications use globally-shared pointer-based data structures and most have irregular concurrency structures. The global namespace and conc annotations of ICC++ enables these computations 32 to be expressed naturally. The application suite falls into two main classes. Applications, such as Polyover, Barnes-Hut, FMM, and SAMR, have irregular concurrency with either semi-static or random load balancing schemes. To e ciently implement these applications in ICC++, the programmer needs only to provide concurrency annotations to the sequential version of the algorithms and load balancing and data placement codes, which are expressed together using collection maps. As a result, fewer than 5% of the lines typically require changes: 81 out of 1695 for Polyover, 107 out of 2631 lines for Barnes-Hut, 163 out of 3951 lines for FMM and 163 out of 6303 lines for SAMR. On the other hand, applications, such as Radiosity, Gr obner, and Phylogeny, additionally require fully dynamic load balancing, application-speci c priority-based scheduling and consistency model for e ciency. For these applications, the programmermust also provide dynamic load balancing policies and additionally priority information and application-speci c data access semantics. These requires more intimate knowledge of the applications. However, Concert provides high-level annotations to allow them to be incorporated into the computation with very small code changes: 11 out of 1919 for Gr obner, 16 out of 2305 for Phylogeny, and 43 out of 3144 for Radiosity. 5 Parallel Performance The high-level language features that simplify program expression are commonly thought to imply some performance degradation. However, the aggressive compiler and runtime technology of Concert delivered good performance on our application suite. We discuss performance based on the Cray T3D implementation of the Illinois Concert System.2 All programs achieve good sequential performance (within a factor of 2 of corresponding C programs). This is achieved using aggressive interprocedural analysis in the Concert compiler and optimizations such as method and object inlining to eliminate overheads of object orientation. Access-region mergin eliminates overhead of the global object space, and speculative stack-heap execution reduces overheads of the implicit, non-binding concurrency speci cation. Our results show that aggressive implementation technology can help high-level programming approaches realize their potential, eliminating programmer management of concerns 2Except for the SAMR code, whose performance results are on the SGI Origin system. 33 such as procedure and thread granularity, and object structuring, otherwise required for performance. Figure 1 shows the speedup of the applications with respect to the non-overhead portion of the single node execution time. These speedups compare favorably with the best speedups reported elsewhere for handoptimized codes [6, 34, 36, 18, 35]. Polyover's speedup is against the best sequential algorithm. The speedup levels o because of insu cient work in the parallel algorithm but is comparable to other reported numbers [41]. Both the Barnes speedup of 42 and FMM speedup of 54 on 64 nodes are competitive with shared-memory versions [42] and hand-optimized versions [38] reported elsewhere in literature. The Radiosity speedup of 23 on 32 T3D processors compares well with the previously reported speedup of 26 on 32 processors of the DASH machine [36], despite DASH's hardware support for cache-coherent shared memory and an order of magnitude faster communication (in terms of processor clocks), which better facilitate scalable performance. Finally, the speedups for both Gr obner and Phylogeny compare favorably with other implementations [20, 6] despite not fully replicating data structures as in the latter cases. The above competitive speedups were made possible by the various transformations described in Section 4 to enhance data locality and load balance. These transformations were all realizable with modest code changes due to the orthogonal framework for functionality and data placement. We illustrate the incremental performance gains due to each transformation by considering one application, Radiosity, as an example. Figure 2 shows the speedup for three versions of the Radiosity program: the base version corresponds to the base programming model (random data distribution, threads execute local to target object), the locality version corresponds to program transformations for data locality, and the load-balance version corresponds to program transformations for load balance. As was described in Section 4.7, both locality and load-balance versions are realizable with changes to less than 2.5% of the source code. The gure clearly shows the importance of data locality and load-balance transformations, which improve speedup on 32 nodes from 4.78 to 19.4 and 23.1 respectively, and on 64 nodes from 5.12 to 31.2 and 38.2 respectively. 34 Polyover Barnes FMM SAMR Radiosity Grobner Phylogeny | 0 | 8 | 16 | 24 | 32 | 40 | 48 | 56 | 64 | 0.0 | 8.0 | 16.0 | 24.0 | 32.0 | 40.0 | 48.0 | 56.0 | 64.0 Number of processors S pe ed up Figure 1: Overall speedup of application suite on the Cray T3D. ideal +load-balance +locality base | 0 | 8 | 16 | 24 | 32 | 40 | 48 | 56 | 64 | 0.0 | 8.0 | 16.0 | 24.0 | 32.0 | 40.0 | 48.0 | 56.0 | 64.0 Number of processors S pe ed up Figure 2: Incremental performance bene ts from data locality and load-balance program transformations for the Radiosity application. 35 6 Discussion and Related Work Despite the wealth of parallel programming research, we know of few application-based studies that systematically evaluate the \programmability" delivered by a parallel language or system. To our knowledge this the the only study which looks at a collection of irregular application programs on general purpose parallel programming system to evaluate its support for high level expression and performance optimization. We nd the outcome encouraging. In the Concert system, much of the di culty of managing the irregular concurrency and modifying a program's data distribution, etc. has been eliminated. If commercial parallel programming environments can incorporate similar technologies the prospects for high performance through high level parallel object-oriented programming are encouraging. Programming techniques for irregular applications is an important body of related work. All of these techniques pursue a library-based approach, and often consider only irregularity in an single dimension. The inspector-executor approach supports iterative irregular communication (such as in nite element simulations) [19], but because it is designed for SPMD execution, provides little support for irregular control ow. Libraries such as Sparspak, Kelp, and A++/P++ support particular classes of algorithms on sparse matrices and adaptive meshes [11, 27], but are basically solver libraries, not a general programming interface for irregular applications. The closest related work is Yelick's Multipol libraries which provide a set of tools to build irregular applications. However, these libraries provide no uniform high level interface, perhaps because the project did not include optimizing compiler support [39]. The best known parallel programming approaches are message passing embodied in MPI [26] and data parallel embodied in [13]. While both of these approaches are widely used, and could be used to build these irregular applications. We believe neither provides appropriate support. Compared to message passing, ICC++ dramatically simpli es the programmability of irregular codes by eliminating the need for explicit name management, reducing programming e ort and allowing data locality and load balance to be optimized independently. While there are a few message passing versions of Barnes-Hut, FMM, SAMR, and Radiosity, they are considered to require heroic programming chie y due to di culties of implementing sophisticated distributed data structures. 36 Data-parallel approaches [13, 24, 2] provide a global namespace but require structured data parallelism, often a poor match for irregular applications. The alignment adds a signi cant task to program formulation, and generally forces the programmer to build structures quite di erent from sequential program versions. This increases the programming e ort. More exible models of data parallels such as nested parallelism [3, 8] are more promising. Another body of related work are a series of application studies for cache-coherent shared memory systems, which employ a model of shared address space and threads [42, 37]. While we have leveraged these studies, borrowing the algorithmic structure and data locality and load balance optimization for BarnesHut, FMM, and Radiosity, the real goal of those studies is architectural evaluation, not programmability. In contrast, the Concert system allows these applications to be programmed at a much higher level yet still achieve excellent scalability. For example, the aggressive compiler optimization in Concert frees programmers from worries about procedure granularity. In addition, Concert's lightweight multithreading eases concerns about over parallelization, making expression of su cient parallelism much easier. Finally, Concert employs both a shared object namespace (perhaps easier to implement) and exible object consistency (which is implemented in software but could be implemented in hardware). While the Concert system's aggressive implementation techniques enable high performance implementations for the seven irregular applications on both distributed and shared memory systems, it is unclear which hardware systems provide the best performance. Although shared-memory machines simplify bookkeeping for implementing a global namespace, they also introduce several new problems, such as in exible coherence (sharing) granularity and protocols, and opaque memory behavior that is more di cult to reason about and optimize. One interesting future direction is performance comparisons across such platforms. 7 Summary Based on our application studies, we must conclude that while the development of irregular parallel applications remains challenging, the Concert System enables such programs to be written with a high level approach. The high level expression actually enhances program optimization, both for the compiler, but also 37 in dealing with the fundamental challenges of data locality and load balance for irregular applications. Invirtually all cases, the complexity of the data locality and load balance scheme was signi cant and in manyways comparable to the application's algorithmic complexity. The clean separation of data locality issuesfrom naming is a clear bene t, improving program exibility dramatically.In addition, the whole program optimizations of the Concert system enable programmers to relegatea number of concerns to the programming system (procedure and computation granularity, namespacemanagement, and low-level concurrency management). While these optimizations may be too expensiveto incorporate in production compilers today, we are optimistic that algorithmic improvements and goodapproximations will make such optimizations a xture in future parallel programming systems.Finally, given appropriate high level programming support, the optimized parallel versions of the irregularapplications had identical basic program structure to their sequential counterparts. Concurrency was addedwith small program perturbation, and data locality and load management was achieved with minor programchanges. In fact, in all cases, less than 5% of original source lines needed to be modi ed to deliver highparallel performance. This bodes well for software developers who inevitably need to support a single codebase for sequential and parallel users.7.1 Future WorkWhile our results are encouraging, concurrent object-oriented approaches can be improved in several ways toincrease their programmability advantages. First, compiler technology which aggressively propagates localityinformation for pointer-based data structures has the potential to further reduce the program optimizatione ort for irregular applications. Second, the exible concurrency structures expressible in ICC++ requireglobal concurrency analysis in general, an open research problem [25], to enable synchronization optimiza-tions. At present, this also increases the explicit annotations required to achieve high performance. Thisis another promising direction for research. Third, in several cases, explicit annotations for locality andload balance had to be re ned with programmer information on task granularity, object consistency, andpriority scheduling. Of these, nding automatic techniques for the rst two seems most promsing. Finally,while we have looked at seven signi cant irregular programs, there is still much to be learned systematic38 application studies of parallel programming systems. We hope that this study will provide insights to bothapplication programmers and programming system designers, stimulating research on the real problemsfacing programmers.AcknowledgmentsThe authors would like to acknowledge the work of John Plevyak and Hao-Hua Chu on the Illinois ConcertSystem.The research described in this paper was supported in part by DARPA Order #E313 through the US AirForce Rome Laboratory Contract F30602-96-1-0286, NSF grant MIP-92-23732, ONR grants N00014-92-J-1961 and N00014-93-1-1086, NASA grant NAG 1-613, and supercomputing resources at the Jet PropulsionLaboratory. Support from Intel Corporation, Tandem Computers, Hewlett-Packard, and Motorola is alsogratefully acknowledged. Andrew Chien is supported in part by NSF Young Investigator Award CCR-94-57809. Vijay Karamcheti was supported in part by an IBM Computer Sciences Fellowship.References[1] J. Barnes and P. Hut. A hierarchical O(N log N) force calculation algorithm. Technical report, The Institute for AdvancedStudy, Princeton, New Jersey, 1986.[2] Peter Beckman, Dennis Gannon, and Elizabeth Johnson. Portable parallel programming in HPC++. Available online athttp://www.extreme.indiana.edu/hpc%2b%2b/docs/ppphpc++/icpp.ps, 1996.[3] G. Blelloch and G. Sabot. Compiling collection-oriented languages onto massively parallel computers. Journal of Paralleland Distributed Computing, 8(2):119{34, 1990.[4] Greg L. Bryan. The Numerical Simulation of X-ray Clusters. PhD thesis, University of Illinois at Urbana-Champaign,August 1996.[5] B. Buchberger. Multidimensional Systems Theory, chapter Grobner Basis: an Algorithmic Method in Polynomial IdealTheory, pages 184{232. D. Reidel Publishing Company, 1985.[6] S. Chakrabarti and K. Yelick. Implementing an irregular application on a distributed memory multiprocessor. In Pro-ceedings of the Fourth ACM/SIGPLAN Symposium on Principles and Practices of Parallel Programming, pages 169{179,May 1993.[7] K. Mani Chandy and Carl Kesselman. Compositional C++: Compositional parallel programming. In Proceedings of theFifth Workshop on Compilers and Languages for Parallel Computing, New Haven, Connecticut, 1992. YALEU/DCS/RR-915, Springer-Verlag Lecture Notes in Computer Science, 1993.[8] S. Chatterjee. Compiling nested data parallel programs for shared memory multiprocessors. ACM Transactions of Pro-gramming Languages and Systems, 15(3), 1993.[9] A. A. Chien, W. Feng, V. Karamcheti, and J. Plevyak. Techniques for e cient execution of ne-grained concurrentprograms. In Proceedings of the Fifth Workshop on Compilers and Languages for Parallel Computing, pages 103{13, NewHaven, Connecticut, 1992. YALEU/DCS/RR-915, Springer-Verlag Lecture Notes in Computer Science, 1993.[10] A. A. Chien, M. Straka, J. Dolby, V. Karamcheti, J. Plevyak, and X. Zhang. A case study in irregular parallel programming.In DIMACS Workshop on Speci cation of Parallel Algorithms, May 1994. Also available as Springer-Verlag LNCS.[11] E. Chu, A. George, J. Liu, and E. Ng. Sparspak: Waterloo sparse matrix package user's guide for sparspak-a. TechnicalReport CS-84-36, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 1984.39 [12] Julian Dolby. Automatic inline allocation of objects. In Proceedings of the 1997 ACM SIGPLAN Conference on Program-ming Language Design and Implementation, June 1997.[13] High Performance Fortran Forum. High performance Fortran language speci cation version 1.0. Technical Report CRPC-TR92225, Rice University, January 1993.[14] Bishwaroop Ganguly, Greg Bryan, Michael Norman, and Andrew Chien. Exploring structured adaptive mesh re nement(SAMR) methods with the Illinois Concert system. In Proceedings of the Eighth SIAM Conference on Parallel Processingfor Scienti c Computing, Minneapolis, Minnesota, March 1997.[15] Leslie Greengard. Fast Algorithms for Classical Physics. MIT Press, Cambridge, MA, 1988.[16] A. Grimshaw. Easy-to-use object-oriented parallel processing with Mentat. IEEE Computer, 5(26):39{51, May 1993.[17] P. Hanrahan, D. Salzman, and L. Aupperle. A rapid hierarchical radiosity algorithm. Computer Graphics (Proc Siggraph),25(4):197{206, July 1991.[18] Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, and Milan Hodo s cek. Parallelizingmolecular dynamics programsfor distributed memory machines. IEEE Computational Science and Engineering, pages 18{29, Summer 1995.[19] J.H. Saltz, et al. A manual for the CHAOS runtime library. Technical Report CS-TK-3437, Department of ComputerScience, University of Maryland, 1995.[20] Je A. Jones. Parallelizing the phylogeny problem. Master's thesis, Computer Science Division, University of California,Berkeley, December 1994.[21] L. V. Kale andW. Shu. The chare-kernel language for parallel programming. In Proceedings of the International Conferenceon Parallel Processing, 1990.[22] Vijay Karamcheti and Andrew A. Chien. View caching: E cient software shared memory for dynamic computations. InProceedings of the International Parallel Processing Symposium, 1997.[23] Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for e cient dynamic multithreading.Journal of Parallel and Distributed Computing, 37(1):21{40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/rtperf.ps.[24] J. Lee and D. Gannon. Object oriented parallel programming. In Proceedings of the ACM/IEEE Conference on Super-computing. IEEE Computer Society Press, 1991.[25] StephenP. Masticola and Barbara G. Ryder. Non-concurrency analysis. In Proceedings of Fourth Symposium on Principlesand Practice of Parallel Programming, pages 129{138, May 1993.[26] Message Passing Interface Forum. The MPI message passing interface standard. Technical report, University of Tennessee,Knoxville, April 1994. Available from http://www.mcs.anl.gov/mpi/mpi-report.ps.[27] R. Parsons and D. Quinlan. A++/P++ array classes for architecture independent nite di erence computations. InProceedings of the Second Annual Object-Oriented Numerics Conference, Sunriver, Oregon, April 1994.[28] John Plevyak. Optimization of Object-Oriented and Concurrent Programs. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1996.[29] John Plevyak and Andrew Chien. Incremental inference of concrete types. Technical Report UIUCDCS-R-93-1829,Department of Computer Science, University of Illinois, Urbana, Illinois, June 1993.[30] John Plevyak and Andrew A. Chien. Precise concrete type inference of object-oriented programs. In Proceedings ofOOPSLA'94, Object-Oriented Programming Systems, Languages and Architectures, pages 324{340, 1994.[31] John Plevyak and Andrew A. Chien. Type directed cloning for object-oriented programs. In Proceedings of the Workshopfor Languages and Compilers for Parallel Computing, pages 566{580, 1995.[32] John Plevyak, Vijay Karamcheti, and Andrew Chien. Analysis of dynamic structures for e cient parallel execution. InProceedings of the Sixth Workshop for Languages and Compilers for Parallel Machines, pages 37{56, August 1993.[33] John Plevyak, Xingbin Zhang, and Andrew A. Chien. Obtaining sequential e ciency in concurrent object-oriented pro-grams. In Proceedings of the ACM Symposium on the Principles of Programming Languages, pages 311{321, January1995.[34] Daniel J. Scales and Monica S. Lam. The design and evaluation of a shared object system for distributedmemorymachines.In First Symposium on Operating Systems Design and Implementation, 1994.[35] Jaswinder Pal Singh. Parallel Hierarchical N-Body Methods and Their Implications For Multiprocessors. PhD thesis,Stanford University Department of Computer Science, Stanford, CA, February 1993.[36] Jaswinder Pal Singh, Anoop Gupta, and Marc Levoy. Parallel visualization algorithms: Performance and architecturalimplications. IEEE Computer, 27(7):45{56, July 1994.[37] Jaswinder Pal Singh, Wolf-DietrichWeber, and Anoop Gupta. SPLASH: Stanford parallel applications for shared memory.Technical report CSL-TR-91-469, Computer SystemsLaboratory,StanfordUniversity, StanfordUniversity, CA 94305, April1991. Available from ftp://mojave.stanford.edu/pub/splash/report/splash.ps.[38] M. Warren and J. Salmon. A parallel hashed oct-tree N-body algorithm. In Proceedings of Supercomputing Conference,pages 12{21, 1993.40 [39] Chih-Po Wen, Soumen Chakrabarti, Etienne Deprit, Arvind Krishnamurthy, and Katherine Yelick. Run-time support forportable distributed data structures. In Third Workshop on Languages, Compilers, and Run-Time Systems for ScalableComputers, pages 111{120, Boston, May 1995. Kluwer Academic Publishers.[40] Gregory V. Wilson and Paul Lu, editors. Parallel Programming Using C++. MIT Press, 1995.[41] Gregory V. Wilson and Paul Lu, editors. Parallel Programming Using C++, chapter ICC++. MIT Press, 1995.[42] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 pro-grams: Characterization and methodological considerations. In Proceedings of the International Symposium on ComputerArchitecture, pages 24{36, 1995.[43] Akinori Yonezawa, editor. ABCL: An Object-Oriented Concurrent System. MIT Press, 1990. ISBN 0-262-24029-7.[44] Xingbin Zhang and Andrew A. Chien. Dynamic pointer alignment: Tiling and communication optimizations for paral-lel pointer-based computations. In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming, Las Vegas, Nevada, June 1997.[45] Xingbin Zhang, Vijay Karamcheti, Tony Ng, and Andrew Chien. Optimizing COOP languages: Study of a proteindynamics program. In IPPS'96, 1996.
منابع مشابه
ICC + + { A C + + Dialect for High Performance Parallel ComputingA
ICC++ is a new C++ concurrent dialect which allows sequential/parallel program versions to be maintained with single source, the construction of concurrent data abstractions, convenient expression of irregular and ne-grained concurrency, and supports high performance implementations. ICC++ provides annotations for potential concurrency, facilitating both sharing source with sequential programs ...
متن کاملAcceleration of Upper Trunk Coordination in Young Versus old Adults During Walking on the Level and Irregular Floor Surface Using MTx Sensor
Objectives: To evaluate the reliability of head and trunk acceleration measured by MTx sensors during walking on Level and Irregular surfaces and to compare the differences between healthy young and old adults. Methods: Participants were 20 young female university students and 20 non-faller elderly women in Iran, 2013. Two MTX sensors were used to measure head and trunk accelerations in the ...
متن کاملICC++-AC++ Dialect for High Performance Parallel Computing
ICC++ is a new concurrent C++ dialect which supports a single source code for sequential and parallel program versions, the construction of concurrent data abstractions, convenient expression of irregular and ne-grained concurrency, and high performance implementations. ICC++ programs are annotated with potential concurrency, facilitating both sharing source with sequential programs and automat...
متن کاملA Novel Generalized Topology for Multi-level Inverter with Switched Series-parallel DC Sources (RESEARCH NOTE)
This paper presents a novel topology of single-phase multilevel inverter for low and high power applications. It consists of polarity (Level) generation circuit and H Bridge. The proposed topology can produce higher output voltage levels by connecting dc voltage sources in series and parallel. The proposed topology utilizes minimum number of power electronic devices which helps in reduction o...
متن کاملCompiler Support for Machine Independent Parallelization of Irregular Problems Compiler Support for Machine Independent Parallelization of Irregular Problems
The Fortran D group at Rice University aims at providing a machine independent data parallel programming style, in which the applications programmer uses a dialect of sequential Fortran and high level distribution annotations. Extracting parallelism from these applications typically is straightforward, but making eecient use of this par-allelism for irregular applications, such as molecular dyn...
متن کاملThe Reliability and Concurrent Validity of Digital Inclinometer, Smartphone Applications, and the Cervical Range of Motion Device for Measuring the Cervical Range of Motion
Objectives: Changes in the Range of Motion (ROM) are essential criteria in determining the severity of spinal disorders and could be effective in predicting pain progression. Instruments to measure the ROM are costly and unavailable in most therapy settings. While there is a tendency in therapists to use their smartphones instead, there is no report to measure the suitability of smartphones to ...
متن کامل